{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_09_1_image_genai.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T81-559: Applications of Generative Artificial Intelligence\n",
    "**Module 9: MultiModal and Text to Image Models**\n",
    "* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)\n",
    "* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 9 Material\n",
    "\n",
    "Module 9: MultiModal and Text to Image\n",
    "\n",
    "* **Part 9.1: Introduction to MultiModal and Text to Image** [[Video]](https://www.youtube.com/watch?v=lcUsade04pg&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_09_1_image_genai.ipynb)\n",
    "* Part 9.2: Generating Images with DALL·E Kaggle Notebooks [[Video]](https://www.youtube.com/watch?v=CBfT1y1V1e0&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_09_2_dalle.ipynb)\n",
    "* Part 9.3: DALL·E Existing Images [[Video]](https://youtube.com/watch?v=5gdaXrJs3Kk&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_09_3_dalle_existing.ipynb)\n",
    "* Part 9.4: MultiModal Models [[Video]](https://www.youtube.com/watch?v=rYlj9t_wlFA&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_09_4_multimodal.ipynb)\n",
    "* Part 9.5: Illustrated Book [[Video]](https://www.youtube.com/watch?v=TTGen7P3ScU&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_09_5_illustrated_book.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 9.1: MultiModal and Text to Image Models\n",
    "\n",
    "In this module, we will explore text-to-image models and how they serve as a foundation for multimodal models. Text-to-image models represent a significant step forward in AI, transforming natural language into visual representations. These models, such as DALL·E and Stable Diffusion, leverage deep learning techniques to generate highly detailed images based on textual descriptions. They operate by mapping the relationships between words and visual elements, allowing the model to \"understand\" how language can describe scenes, objects, and concepts, and then generate corresponding images.\n",
    "\n",
    "Multimodal models, on the other hand, take this concept further by integrating multiple forms of data—typically text, images, and sometimes even audio or video. These models are designed to handle and correlate different types of information simultaneously. For example, multimodal systems can generate text descriptions of images, answer questions about a scene (visual question answering), or even generate videos based on a sequence of text inputs. By combining the capabilities of both text and image models, multimodal systems expand the scope of AI applications, enabling more dynamic and context-aware interactions.\n",
    "\n",
    "While text-to-image models focus on synthesizing visual content from language, multimodal models build on this by allowing for richer, more complex forms of interaction between different types of media. This module will examine how text-to-image models work, how they are trained, and how their underlying architectures influence the development of broader multimodal systems. We will also cover use cases and explore the future possibilities that arise from combining various data modalities.\n",
    "\n",
    "## Text to Image Models\n",
    "\n",
    "In this section, we will dive into text-to-image models, a cutting-edge area of artificial intelligence that allows machines to generate images based on textual descriptions. These models bridge the gap between language and visual content, translating a natural language prompt into a fully realized image. Text-to-image generation is powered by deep learning techniques, specifically a combination of natural language processing (NLP) and computer vision.\n",
    "\n",
    "The key innovation behind text-to-image models is their ability to understand the relationships between words and visual elements. By training on large datasets that pair text with corresponding images, these models learn how to map language inputs, such as \"a cat sitting on a windowsill,\" to relevant visual features like shapes, textures, colors, and spatial arrangements. Popular models such as DALL·E, MidJourney, and Stable Diffusion have showcased the creative potential of this technology by generating photorealistic or artistic images directly from descriptive text.\n",
    "\n",
    "Here you can see StableDiffusion render a cat in a windowsill:\n",
    "\n",
    "![Stable Diffusion Renders a Cat](https://data.heatonresearch.com/images/wustl/app_genai/cat_window_stable_diff.jpg)\n",
    "\n",
    "DALL·E takes this one step further and uses a LLM to extend your prompt, this also allows DALL·E to ensure your prompt is \"safe\" and does not create inappropriate images. Here you can see what DALL·E extends the prompt to:\n",
    "\n",
    "```\n",
    "A cozy scene of a cat sitting on a windowsill, with soft natural light streaming in from outside. The cat is sitting gracefully, looking outside with its tail curled around its paws. The windowsill is decorated with a few potted plants, and the outside view shows a peaceful garden with green foliage. The room inside is warm and inviting, with subtle shadows created by the light. The cat has sleek fur and a relaxed posture.\n",
    "```\n",
    "\n",
    "The above prompt produces this image:\n",
    "\n",
    "![Stable Diffusion Renders a Cat](https://data.heatonresearch.com/images/wustl/app_genai/cat_window_dalle.jpg)\n",
    "\n",
    "In this module, we will explore the architecture behind these models, including transformers, generative adversarial networks (GANs), and diffusion models, and examine how they capture the relationship between text and images. Additionally, we will discuss the applications of text-to-image models in fields such as art, design, advertising, and content generation, as well as the ethical considerations that come with such powerful technology. By the end of this module, you will understand how text-to-image models work, how they are trained, and their transformative impact on both AI research and creative industries.\n",
    "\n",
    "## MultiModal Models\n",
    "\n",
    "In this section, we will explore the fascinating world of multimodal models, which are designed to process and integrate different types of data—such as text, images, audio, and video—into a single, unified understanding. These models represent a significant leap in AI, allowing machines to perform more complex tasks by combining inputs from multiple sources, much like humans do when perceiving and interacting with the world.\n",
    "\n",
    "Multimodal models are built on the foundation of specialized models like text-to-image systems, but they go further by enabling AI to analyze and understand relationships between diverse forms of information. A multimodal system, for example, can generate text from images (image captioning), answer questions about visual content (visual question answering), or even generate images based on both text and other media inputs.\n",
    "\n",
    "An exciting real-world example of multimodal capabilities is the ability to analyze a hand-drawn tic-tac-toe board and answer questions about it. For instance, you could show the model an image of a hand-drawn tic-tac-toe game and ask, \"Who won this game?\" A multimodal model could interpret the image, identify the placement of X’s and O’s, and determine the winner based on the game's rules, all without requiring any additional text-based input about the board's layout.\n",
    "\n",
    "![Tic Tac Toe](https://data.heatonresearch.com/images/wustl/app_genai/tictactoe.jpg)\n",
    "\n",
    "This integration of multiple data types allows for more sophisticated and intuitive interactions between humans and AI, making multimodal models a powerful tool for a wide range of applications, from healthcare and education to entertainment and design. In this module, we will explore the key architectures that enable these models, such as Vision-Language Transformers and CLIP (Contrastive Language-Image Pre-training), and discuss their applications, advantages, and potential future developments. You will gain an understanding of how multimodal models are trained, how they build on text-to-image models, and how they are changing the landscape of AI-powered solutions.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 9 Assignment\n",
    "\n",
    "You can find the first assignment here: [assignment 9](https://github.com/jeffheaton/app_generative_ai/blob/main/assignments/assignment_yourname_class9.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
